-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CFGGuide
to use outlines.fsm.parsing
. Enable generate.cfg
#1067
Conversation
Please provide comments for these issues. I will create the New Issues once they're refined / approved. New Issues: Direct
|
@@ -614,6 +652,8 @@ def __init__(self, conf: "LexerConf", states, always_accept=()): | |||
lexer_conf.terminals = [ | |||
terminals_by_name[n] for n in accepts if n in terminals_by_name | |||
] | |||
if not lexer_conf.terminals: | |||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Bug-fix for case where no lexer_conf.terminals
is empty (happens when EOS is only legal next terminal)
token_history=lexer_state.last_token and [lexer_state.last_token], | ||
state=parser_state, | ||
terminals_by_name=self.root_lexer.terminals, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Fixes the following tests in test_cfg_guide.py::test_cfg_next_token
- Multiple Valid Continuations
- Token is Substring of Another Token
- Recursive Patterns
We need a more general and persistent warning that explains that the CFG implementation is (and has always been) experimental community-contributed code. We may even need to clarify that it does not reflect the approach described in our technical report—aside from its use of incremental/partial parsing. |
Thanks, I'll update |
19129e5
to
23a310e
Compare
a5f529d
to
823559a
Compare
db36739
to
5922b3b
Compare
I have added rejection sampling. It checks each token for acceptance, starting with highest logprob, completing once one sample is accepted. This is effectively greedy sampling. This behavior is documented in It is used by default in Benchmarks
Benchmarks aren't a strong indicator though, it's performance improvement is entirely dependent on the fraction of tokens which are valid under the grammars production rules at each state used in sampling. |
# normalize | ||
if state.prev_token is None: | ||
new_token_str = self.tokenizer.decode([token_id])[0] | ||
else: | ||
prev_token_str = self.tokenizer.decode([[state.prev_token]])[0] | ||
combined_token_str = self.tokenizer.decode([[state.prev_token, token_id]])[ | ||
0 | ||
] | ||
new_token_str = combined_token_str[len(prev_token_str) :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This token normalization step, which determines a tokens decoded value provided the previous token, is necessary for correctness of RegexFSM
as well. Should be a separate issue.
As a I've signaled here in 28 April #788 (comment). I've been working on a parser recently which I would say is around 95% of progress. It's built from scratch and also dependency free (excluding tests), it has its own internals when it comes to guiding a CFG. It's solely designed for that purpose. @rlouf @brandonwillard @lapp0 Should we have a look at it when it's finished? Should we discuss it in the discord server? The discord link in GitHub is broken BTW, it would be nice if someone could fix it. |
Very interesting. Yes, we can discuss here or on Discord. Here's a temporary invite while we sort out the invite situation https://discord.gg/H7pEAMPZ Edit: I tried all 3 discord links on GitHub, they all work for me. Which one is broken? |
@lapp0 It seems like the problem came from my discord. I was referring to the link in the "Join us" section, it does work. Thanks. |
e763f29
to
b8d5b42
Compare
Rendered Docs
Fixes:
Changes
CFGGuide
CFGGuide
based on @brandonwillard's implementation inexamples/parsing.py
outlines.fsm.parsing
to handle some edge casesaccepts()
andfeed_eof()
for termination checking.CFGFSM
andCFGFSM
testsGrammars
ESCAPED_STRING
injson.lark
andcommon.lark
Integrations
outlines.generate.cfg(...)
viaSequenceGeneratorAdapter
outlines.processors.CFGLogitsProcessor
Testing
tests/fsm/test_cfg_guide.py
test_cfg_next_token
: assert that given a sequence of prior tokens generated, the expected next tokens in a vocabulary are allowed.test_cfg_grammar_sample
: Provided a sample valid with a grammar,token_ids = tokenizer.encode(sample)
Assert thattoken_ids
can be produced byCFGGuide
. Allows for a new test to be created by simply adding an example totests/cfg_samples/
Test
outlines.generate.cfg
viatests/generate/test_generate.py
Update
tests/fsm/test_guide.py
to test forCFGGuide.must_terminate_state()
andCFGGuide.can_terminate_state()
Benchmarks
benchmarks/bench_cfg_guide.py
: measureCFGGuide
construction time, token run time, and token run peak-memoryAnalysis
Using
gpt2
tokenizer: regardless of length, 10 tokens, 40 tokens, or 100 tokens, it takes ~1.2 seconds to generate a token.Unsurprisingly
get_next_instruction
takes most of the time, totaling over 99.99% of the runtime. It's intuitive considering the same operation is applied forget_next_state
, but for a single token instead of once for each of gpt2's 50,257 tokens.Breakdown:
cProfile:
TODO